Sample tk data #73

lizgzil · 2021-10-26T16:37:40Z

Fixing #68

find and save a sample of the text kernel data (get_tk_sample.py)
update predict sentence class to be able to use data from a pre-found sample (rather than sampling within this script). Also works for old method too.
update readmes

This samples 5 million job adverts randomly. Thus when the skill sentence predictions are outputted, although in sum there are more - each file contains less data. Previously 100 random files were selected and the first 10k job adverts were processed. Now there is data from 647 of these files, and a random selection from each (which is less that 10k).

Timing

Each file goes through the same algorithm independently, and takes roughly 15-20minutes. These are the timings for each step of the algorithm for 1 data file out of 647:

2021-10-26 17:34:02,278 - __main__ - INFO - Loading data from inputs/data/textkernel-files/historical/2020/2020-03-11/jobs_2.110.jsonl.gz ...
2021-10-26 17:34:18,424 - __main__ - INFO - Splitting sentences ...
2021-10-26 17:39:30,769 - __main__ - INFO - Splitting sentences took 312.3075065612793 seconds
2021-10-26 17:39:30,795 - __main__ - INFO - Processing sentences took 312.3337023258209 seconds
2021-10-26 17:39:30,795 - __main__ - INFO - Transforming skill sentences ...
Getting embeddings for 168906 texts ...
.. with multiprocessing
2021-10-26 17:39:30,795 - sentence_transformers.SentenceTransformer - INFO - Start multi-process pool on devices: cuda:0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2021-10-26 17:39:33,602 - sentence_transformers.SentenceTransformer - INFO - Chunk data into packages of size 5000
Took 643.1803483963013 seconds
2021-10-26 17:53:31,097 - __main__ - INFO - Chunking up sentences ...
2021-10-26 17:53:31,248 - __main__ - INFO - Chunking up sentences into 169 chunks took 0.1508171558380127 seconds
2021-10-26 17:53:31,248 - __main__ - INFO - Predicting skill sentences ...
2021-10-26 17:54:07,619 - __main__ - INFO - Predicting on 168906 sentences took 36.37088871002197 seconds
2021-10-26 17:54:07,619 - __main__ - INFO - Combining data for output ...
2021-10-26 17:54:07,675 - __main__ - INFO - Combining output took 0.05558419227600098 seconds
2021-10-26 17:54:07,675 - __main__ - INFO - Saving data to outputs/sentence_classifier/data/skill_sentences/2021.10.26/textkernel-files/historical/2020/2020-03-11/jobs_2.110_2021.08.16.json ...
2021-10-26 17:54:08,017 - skills_taxonomy_v2.getters.s3_data - INFO - Saved to s3://skills-taxonomy-v2 + outputs/sentence_classifier/data/skill_sentences/2021.10.26/textkernel-files/historical/2020/2020-03-11/jobs_2.110_2021.08.16.json ...

9871 job adverts were in the "historical/2020/2020-03-11/jobs_2.110.jsonl.gz" file.

We would expect an average of 5000000/647 = 7728 of the sampled job adverts in each file. So this file seems to have a particularly large sample of job adverts in.

There will be different numbers of sentences in each job advert, but scaling that up it means that 20mins647 = 9 days. or 20mins(5000000/9871) = 7 days.

Target areas for speeding up!

Around a half of the processing is spent transforming the sentences using the BERT model.

The biggest sticking point is in the transforming of the sentences using the pre-trained BERT model (even when using multiprocessing), i.e. in this function

skills-taxonomy-v2/skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py

Line 76 in 830be30

def transform(self, texts):

. This is called in this PR here.

So could this step be done better in order to speed up that area of the pipeline?

Around a quarter of the processing is spent splitting the sentences.

The second biggest time lag is splitting the text up into sentences. This is done via the function split_sentence - in this PR this function is called here

… sampled data dates as compared to all data, and finally some tweaks to the predict sentence class script to process this new form of data

lizgzil · 2021-10-27T11:13:14Z

skills_taxonomy_v2/pipeline/sentence_classifier/predict_sentence_class.py

+            with Pool(4) as pool:  # 4 cpus
+                partial_split_sentence = partial(split_sentence, nlp=nlp, min_length=30)
+                split_sentence_pool_output = pool.map(partial_split_sentence, data)
+            logger.info(f"Splitting sentences took {time.time() - start_time} seconds")


this takes 5 minutes

now down to <5 seconds thanks to comments by @jaklinger

lizgzil · 2021-10-27T11:20:58Z

skills_taxonomy_v2/pipeline/sentence_classifier/predict_sentence_class.py

+
+            if sentences:
+                logger.info(f"Transforming skill sentences ...")
+                sentences_vec = sent_classifier.transform(sentences)


this takes about 10 minutes

jaklinger · 2021-10-27T13:48:48Z

First note (another coming I think, but have a meeting now!): nltk's sent_tokenize is 10-100x faster than nlp(...). This should bring us from 7-9 days to 4-6 days 😄

jaklinger · 2021-10-27T14:39:29Z

Second note, also applying to the sentence processing is that quite often there is an overhead in creating threads, and so rather than doing 10000 operations over 4 cores in 2500 threads, you can do 4 x 2500 operations over 4 cores in 4 threads. In general, a more practical way to do this is by splitting into chunks and then flattening the output. Potentially here you will make a saving of another factor of 10 on the sentence splitting

def split_sentence_over_chunk(chunk, nlp, min_length):
    partial_split_sentence = partial(split_sentence, nlp=nlp, min_length=min_length)
    return list(map(partial_split_sentence, chunk))

def make_chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

...

with Pool(4) as pool:  # 4 cpus
    chunks = make_chunks(data, 1000)  # chunks of 1000s sentences
    partial_split_sentence = partial(split_sentence_over_chunk, nlp=nlp, min_length=30)
    # NB the output will be a list of lists, so make sure to flatten after this!
    split_sentence_pool_output = pool.map(partial_split_sentence, chunks)

jaklinger · 2021-10-27T15:12:33Z

General comment: you could get a speed up of around 100 by switching the pipeline to metaflow + batch with max-workers=100 whilst splitting the embeddings up into chunks. Something like (note, just pseudo-code) which would fan-out over files and then fan-out again over sentence chunks, and then in the end save some data either locally, to S3 or as an S3 artefact, over which you then do your analytic step.

I suspect that this would take just a couple of hours to run for your whole dataset, so even if it would take 5 days to write it would still be worth it, not taking into account additional development cycles of batches of 5 days 😄

class SentenceFlow(FlowSpec):
	@step
	def start(self):
		self.file_names = job_ad_file_names
		self.next(self.process_sentences, foreach="file_names")

	@batch()
	@step
	def process_sentences(self):
		self.file_name = self.input
        sentence_data = get_sentences(self.file_name)  # a list of dict
        self.chunks = make_chunks(sentence_data)
        self.next(self.embedding_chunks, foreach="chunks")

	@batch()
	@step
	def embedding_chunks(self):
        # save on memory with while/pop
        texts, ids = [], []
        while self.input:
			row = self.input.pop(0)
            texts.append(row['text'])
            ids.append(row['ids'])
        bert_model = SentenceTransformer(bert_model_name)
        bert_model.max_seq_length = 512
        vecs = bert_model.encode(texts)
        self.data = list(*zip(ids, vecs))
        self.next(self.join_embedding_chunks)

	@step
	def join_embedding_chunks(self, inputs):
        self.data = []
		for input in inputs:
             self.data += input.data
        self.next(self.process_sentences)

... etc ...

lizgzil · 2021-10-27T15:21:12Z

First note (another coming I think, but have a meeting now!): nltk's sent_tokenize is 10-100x faster than nlp(...). This should bring us from 7-9 days to 4-6 days 😄

whoa! ok this was much better. Went from 25 secs to 3 secs (on 100 job adverts)

…hunking of data when it's being used

…ta for splitting, change to 27.yaml as default

lizgzil · 2021-11-04T09:37:33Z

After making some changes, the code actually just took 4.5 days to run

…ill sentences to the relevant READMEs

lizgzil · 2021-11-05T16:56:39Z

There are some files included in the sample which don't contain the full text metadata. This leaves us with 4,312,285 job adverts job adverts from the following distribution over time (in comparison to all the data files minus the ones without the full text metadata)

lizgzil added 4 commits October 26, 2021 17:35

Add script to create a sample of job adverts, also a notebook to view…

b7fe3dc

… sampled data dates as compared to all data, and finally some tweaks to the predict sentence class script to process this new form of data

New config for predict sentnce class

3ce5aee

Remove testing change

ff06402

config file for getting job advert sample

0982566

lizgzil commented Oct 27, 2021

View reviewed changes

lizgzil added 4 commits October 27, 2021 17:30

Use nltk sentence splitter instead of nlp and include functions for c…

a10348e

…hunking of data when it's being used

Add a batch_size parameter to BertVectorizer and SentenceClassifier

11f75d5

Use a multiprocess and batch size optional parameters and chunk up da…

50924a9

…ta for splitting, change to 27.yaml as default

Add new config for predict skill sentences

38876ee

lizgzil marked this pull request as ready for review October 28, 2021 14:53

lizgzil added 3 commits November 5, 2021 15:48

Add information about the data sampling and the new run of predict sk…

ea0ca60

…ill sentences to the relevant READMEs

change batch size back to 32

1b79e3c

Outputs plots of data sample proportion comparisons

c621c22

lizgzil closed this Nov 5, 2021

lizgzil reopened this Nov 5, 2021

lizgzil merged commit 078e135 into dev Nov 5, 2021

lizgzil deleted the sample_tk_data branch November 5, 2021 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample tk data #73

Sample tk data #73

lizgzil commented Oct 26, 2021 •

edited

Loading

lizgzil Oct 27, 2021

lizgzil Oct 27, 2021

lizgzil Oct 27, 2021

jaklinger commented Oct 27, 2021 •

edited

Loading

jaklinger commented Oct 27, 2021

jaklinger commented Oct 27, 2021 •

edited

Loading

lizgzil commented Oct 27, 2021

lizgzil commented Nov 4, 2021

lizgzil commented Nov 5, 2021 •

edited

Loading

Sample tk data #73

Sample tk data #73

Conversation

lizgzil commented Oct 26, 2021 • edited Loading

Timing

Target areas for speeding up!

lizgzil Oct 27, 2021

Choose a reason for hiding this comment

lizgzil Oct 27, 2021

Choose a reason for hiding this comment

lizgzil Oct 27, 2021

Choose a reason for hiding this comment

jaklinger commented Oct 27, 2021 • edited Loading

jaklinger commented Oct 27, 2021

jaklinger commented Oct 27, 2021 • edited Loading

lizgzil commented Oct 27, 2021

lizgzil commented Nov 4, 2021

lizgzil commented Nov 5, 2021 • edited Loading

lizgzil commented Oct 26, 2021 •

edited

Loading

jaklinger commented Oct 27, 2021 •

edited

Loading

jaklinger commented Oct 27, 2021 •

edited

Loading

lizgzil commented Nov 5, 2021 •

edited

Loading